Solving Data Sparsity by Morphology Injection in Factored SMT

نویسندگان

  • Sreelekha S
  • Piyush Dungarwal
  • Pushpak Bhattacharyya
  • D. Malathi
چکیده

SMT approaches face the problem of data sparsity while translating into a morphologically rich language. It is very unlikely for a parallel corpus to contain all morphological forms of words. We propose a solution to generate these unseen morphological forms and inject them into original training corpora. We observe that morphology injection improves the quality of translation in terms of both adequacy and fluency. We verify this with the experiments on two morphologically rich languages: Hindi and Marathi, while translating from English.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Morphology In Statistical Machine Translation From English To Highly Inflectional Language

In this paper, we investigate the role of morphology in phrase-based statistical machine translation (SMT) from English to the highly inflectional Slovenian language. Translation to an inflectional language is a challenging task because of its morphological complexity. Rich morphology increases data sparsity and worsens the quality of statistical machine translation. The idea of the paper is to...

متن کامل

English-Latvian SMT: knowledge or data?

In cases when phrase-based statistical machine translation (SMT) is applied to languages with rather free word order and rich morphology, translated texts often are not fluent due to misused inflectional forms and wrong word order between phrases or even inside the phrase. One of possible solutions how to improve translation quality is to apply factored models. The paper presents work on Englis...

متن کامل

Morphology Generation for Statistical Machine Translation

When translating into morphologically rich languages, Statistical MT approaches face the problem of data sparsity. The severity of the sparseness problem will be high when the corpus size of morphologically richer language is less. Even though we can use factored models to correctly generate morphological forms of words, the problem of data sparseness limits their performance. In this paper, we...

متن کامل

Addressing some Issues of Data Sparsity towards Improving English- Manipuri SMT using Morphological Information

The performance of an SMT system heavily depends on the availability of large parallel corpora. Unavailability of these resources in the required amount for many language pair is a challenging issue. The required size of the resource involving morphologically rich and highly agglutinative language is essentially much more fo r the SMT systems. This paper investigates on some of the issues on en...

متن کامل

Addressing Problems across Linguistic Levels in SMT: Combining Approaches to Model Morphology, Syntax and Lexical Choice

Morphological complexity • Data sparsity due to uncovered inflected forms • Difficulty to produce the correct target-side inflection based on available information COMBINING APPROACHES • Pre-processing – syntactic level Source-side reordering (Gojun and Fraser, 2012) • At decoding time – lexical level Discriminative classifier to score translation rules using source-side context (Tamchyna et al...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015